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Description 

[0001] The present invention is related to the subject matter of EP-A-767435. 

S BACKGROUND OF THE INVENTION 

[0002] The present invention relates, in general, to the field of file systems ("FS") of computer operating systems 
("OS"). More particularly, the present invention relates to a single transaction technique for a journaling file system of 
a computer operating system in which a journal, or log, contains sequences of file system updates grouped into atomic 

10 transactions which are committed with a single computer mass storage device write operation. 

[0003] Modern UNIX(R) OS file systems have significantly increased overall computer system availability through the 
use of "journaling" in which a journal, or log, of file system operations is sequentially scanned at boot time. In this 
manner, a file system can be brought on-line more quickly than implementing a relatively lengthy check-and-repair step. 
[0004] Unfortunately, journaling may nevertheless serve to decrease a FS performance in synchronous operations, 

IS which type of operations are required for compliance with several operating system standards such as POSIX, SVID 
and NFS. Synchronous file system operations are ones in which each operation is treated as a separate transaction 
and each such operation requires at least one write to an associated computer mass storage, or disk drive, per oper- 
ation. Stated another way, a synchronous file system operation is one in which all data must be written to disk, or the 
transaction "committed", before returning to a particular application program. As such, synchronous operations can 

20 decrease a journaling FS performance by creating a "bottleneck" at the logging device as each synchronous operation 
writes its transaction into the log. 

[0005] US-A-5,095,421 discloses a transactional support system which is capable of enhancing a basic operating 
system so that a multitude of databases can be simultaneously processed. The transactional support system provides 
coordination services which designate the boundaries indicating the success or failure of an executed transaction; 

25 concurrency services provide a locking mechanism for controlling access to resources and deadlock detection in the 
event of the imposition mutual locks; and recovery services maintain a log which ensures that the state of the resource 
is preserved in the event of any failures. A transaction is defined as a group of individual operations which can occur 
across a number of resources. Further, enhanced terminal handling and transaction scheduling support the large 
number of terminals used in transactional system, by utilizing methods which relieve the resource consumption asso- 

30 ciated with a large number terminals. If a resource operation is successful, the transaction is globally committed, and 
the transaction manager becomes responsible for synchronizing the commit operation between all participating re- 
sources. However, US-A-5,095,421 does not disclose a method for committing a single file system transaction to a 
mass storage device by means of a single write operation. 

35 SUMMARY OF THE INVENTION 

[0006] The present invention provides a method and a computer system for writing data to a computer mass storage 
device in a single write operation in conjunction with a computer operating system having a journaling file system, in 
accordance with claims 1 and 5 which follow. 

40 [0007] The single transaction technique for journaling file systems disclosed herein is of especial utility in overcoming 
the performance degradation which may be experienced in conventional UNIX journaling file systems by entering each 
file system operation into the current active transaction. Consequently, each transaction is composed of a plurality of 
file system operations which are then simultaneously committed with a single computer mass storage device disk drive 
"write". In addition to increasing overall file system performance under even light computer system operational loads, 

45 even greater performance enhancement is experienced under relatively heavy loads. 

[0008] In order to effectuate the foregoing, a method is herein disclosed for writing data to a computer mass storage 
device in conjunction with a computer operating system having a journaling file system. The method comprises the 
steps of opening a single file system transaction for accumulating a plurality of current synchronous file system oper- 
ations; performing the plurality of current synchronous file system operations and then closing the single file system 

50 transaction upon completion of a last of the current 

[0009] US-A-5, 61 3,060 (Britton et al) describes a synchronous resynchronization of a commit procedure providing 
resource recovery if there has been a failure during the commit procedure. An application is run on a processor and 
requests a work operation involving a resource such as a protected conversation with another application in a different 
real machine. A commit procedure is begun for the work request, and if the commit procedure fails before completion, 

55 the following steps are taken to optimize the use of one or both of the applications. At some time after the commit 
procedure fails, a return code is sent to at least the application that initiated the commit indicating the result of the 
application commit order and that the application can continue to run and does not have to wait for resynchronization 
(recovery). Then, while the initiating application continues to run and do other useful work, resynchronization is imple- 
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merited in parallel, asynchronously. 

[0010] However, US-A-5,61 3,060 does not describe opening a single file system logging transaction for accumulating 
a plurality of current synchronous file system operations, where each synchronous file system operation comprises a 
file system operation generated by an external application in which all data must be committed before said application 
5 program code can continue executing. Instead, this prior art deals with synchronizing two file systems that have become 
asynchronous due to a failed operation. It describes a "one phase commit" in which there is no writing to the recovery 
facility log. (Col. 31 , line 40). Thus there is no suggestion of the opening of a single file system logging transaction as 
in the present invention. 

[0011] Furthermore, an important feature of the present invention is that the single logging transaction accumulates 
10 multiple synchronous file system operations before the transaction is closed and written to the mass storage logging 
device (i.e., committed). 

[0012] US-A-5,61 3,060 requires (e.g. Col. 7, lines 22-25) that the log be written to each and every phase or step of 

the commit procedure. In contrast, the present invention calls for committing the single file system transaction (having 
recorded therein a plurality of performed synchronous file system operations) to the mass storage logging device in a 
15 single write operation, file system operations. The single file system transaction is then committed to the computer 
mass storage device in a single write operation. 

[0013] The present invention is implemented, in part, by adding a journal, or log, to the OS file system including any 
System V-based UNIX® OS incorporating a UFS layer or equivalent, the IBM AIX® or Microsoft Windows NT™ op- 
erating systems. The journal contains sequences of file system updates grouped into atomic transactions and is man- 

20 aged by a novel type of metadevice, the metatrans device. The addition of a journal to the operating system provides 
faster reboots and fast synchronous writes (e.g. network file system ("NFS"), 0_SYNC and directory updates). 
[0014] In the specific embodiment disclosed herein, the present invention is advantageously implemented as an 
extension to the UFS file system and serves to provide faster synchronous operations and faster reboots through the 
use of a log. File system updates are safely recorded in the log before they are applied to the file system itself. The 

25 design may be advantageously implemented into corresponding upper and lower layers. At the upper layer, the UFS 
file system is modified with calls to the lower layer that record file system updates. The lower layer consists of a pseudo- 
device, the metatrans device, that is responsible for managing the contents of the log. 

[001 5] The metatrans device is composed of two subdevices, the logging device, and the master device. The logging 
device contains the log of file system updates, while the master device contains the file system itself. The existence 
30 of a separate logging device is invisible to user program code and to most of the kernel. The metatrans device presents 
conventional block and raw interfaces and behaves like an ordinary disk device. 

[001 6] Utilizing conventional OS approaches, file systems must be checked before they can be used because shutting 
down the system may interrupt system calls that are in progress and thereby introduce inconsistencies. Mounting a 
file system without first checking it and repairing any inconsistencies can cause "panics" or data corruption. Checking 

35 is a relatively slow operation for large file systems because it requires reading and verifying the file system meta-data. 
Utilizing the present invention, file systems do not have to be checked at boot time because the changes from unfinished 
system calls are discarded. As a result, it is ensured that on-disk file system data structures will always remain con- 
sistent, that is, that they do not contain invalid addresses or values. The only exception is that free space may be lost 
temporarily if the system crashes while there are open but unlinked files without directory entries. A kernel thread 

40 eventually reclaims this space. 

[0017] The present invention also improves synchronous write performance by reducing the number of write oper- 
ations and eliminating disk seek time. Writes are smaller because deltas are recorded in the log rather than rewriting 
whole file system blocks. Moreover, there are fewer of the blocks because related updates are grouped together into 
a single write operation. Disk drive seek time is significantly reduced because writes to the log are sequential. 

45 [0018] As described herein with respect to a specific embodiment of the present invention, UFS on-disk format may 
be retained, no changes are required to add logging to an existing UFS file system and the log can subsequently be 
removed to return to standard UFS with UFS utilities continuing to operate as before. Additionally, file systems do not 
have to be checked for consistency at boot time. The driver must scan the log and rebuild its internal state to reflect 
any completed transactions recorded there. The time spent scanning the log depends on the size of the log device but 

50 not on the size of the file system. For reasonably foreseeable configuration choices, scan times on the average of 1 -1 0 
seconds per gigabyte of file system capacity may be encountered. 

[0019] NFS writes and writes to files opened with O SYNC are faster because file system updates are grouped 
together and written sequentially to the logging device. This means fewer writes and greatly reduced seek time. Sig- 
nificantly imroved speed-up may be exptected at a cost of approximately 50% higher central processor unit ("CPU") 
55 overhead. Also, NFS directory operations are faster because file system updates are grouped together and written 
sequentially to the logging device. Local operations are even faster because the logging of updates may optionally be 
delayed until sync(), fsync(), or a synchronous file system operation. If no logging device is present, directory operations 
may be completed synchronously, as usual. 
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[0020] If a power failure occurs while a write to the master or logging device is in progress, the contents of the last 
disk sector written is unpredictable and may even be unreadable. The log of the present invention is designed so that 
no file system metadata is lost under these circumstances. That is, the file system remains consistent in the face of 
power failures. In the specific embodiment described in detail herein, users may set up and administer the metatrans 
s device using standard MDD utilities while the metainit(1 m), metaparam(1 m), and metastat(1 m) commands have small 
extensions. Use is therefore simplified because there are no new interfaces to learn and the master device and logging 
device together behave like a single disk device. Moreover, more than one UFS file system can concurrently use the 
same logging device. This simplifies system administration in some situations. 

[0021] In conventional UNIX File System (UFS) implementations, the file system occupies a disk partition, and the 
10 file system code performs updates by issuing read and write commands to the device driver for the disk. With the 
extension of the present invention, file system information may be stored in a logical device called a metatrans device, 
in which case the kernel communicates with the metatrans driver instead of a disk driver. Existing UFS file systems 
and devices may continue to be used without change. 

15 BRIEF DESCRIPTION OF THE DRAWINGS 

[0022] The aforementioned and other features and objects of the present invention and the manner of attaining them 
will become more apparent and the invention itself will be best understood by reference to the following description of 
a preferred embodiment taken in conjunction with the accompanying drawings, wherein: 

20 

Fig. 1 is a simplified representational drawing of a general purpose computer forming a portion of the operating 

environment of the present invention; 

Fig. 2 is a simplified representational illustration providing an architectural overview of how selected elements of 
the computer program for effectuating a representative implemention of the present invention interact with the 
25 various layers and interfaces of a computer operating system; 

Fig. 3 is a more detailed representative illustration of the major functional components of the computer program 
of Fig. 2 showing in greater detail the components of the metatrans device and its interaction through the Vop or 
VFS interface of a System V-based computer operating system in accordance with the exemplary embodiment 
hereinafter described; 

30 Fig. 4 is a simplified logical block diagram illustative of the fact that the unit structure for the metatrans devices 

contains the address of the logging device unit structure and vice versa; 

Fig. 5 is an additional simplified logical block diagram illustrative of the fact that the logging device's unit structures 
are maintained on a global linked list anchored by ul_list and that each of the metatrans unit structures for the 
metatrans devices sharing a logging device are maintained on a linked list anchored by the logging device's unit 

35 structure; 

Fig. 6 is a further simplified logical block diagram showing that the logmap contains a mapentry_t for every delta 
in the log that needs to be rolled to the master device and the map entries are hashed by (metatrans dey metatrans 
device offset) and maintained on a linked list in the order that they should be rolled in; 

Fig. 7 is a simplified logical block diagram showing that the unit structures for the metatrans device and the logging 
40 device contain the address for the logmap; 

Fig. 8 is an additional simplified logical block diagram illustrative of the fact that a deltamap is associated with each 
metatrans device and stores the information regarding the changes that comprise a file system operation with the 
metatrans device creating a mapentry for each delta which is stored in the deltamap; 

Fig 9 is a further simplified logical block diagram showing that, at the end of a transaction, the callback recorded 
45 with each map entry is called and the logmap layer stores the delta plus data in the log's write buffer and puts the 

map entries into the logmap; 

Fig. 10 is a simplified logical block diagram showing that the logmap is also used for read operations and, if the 
buffer being read does not overlap any of the entries in the logmap, then the read operation is passed down to the 
master device, otherwise, the data for the buffer is a combination of data from the master devie and data from the 
50 logging device; 

Fig. 11 illustrates that, early in the boot process, each metatrans device records itself with the UFS fucntion, 
ufs_trans_set, creates a ufstrans struct and links it onto a global linked list; 

Fig, 12 further illustrates that, at mount time, the file system checks its dev_t against the other dev_t's stored in 
the ufstrans structs and, if there is a match, the file system stores the address of the ufstrans struct in its file system 
55 specific per-mount struct (ufsvfs) along with its generic per-mount struct (vfs) in the ufstrans struct; and 

Fig. 13 is an additional illustration of the interface between the operating system kernel and the metatrans driver 
shown in the preceding figures showing that the file system communicates with the driver by calling entry points 
in the ufstransops struct, inclusive of the begin-ope ration, end-operation and record-delta functions. 
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DESCRIPTION OF A PREFERRED EMBODIMENT 

[0023] The environment in wliich the present invention is used encompasses the general distributed computing sys- 
tem, wherein general purpose computers, workstations or personal computers are connected via communication links 

5 of various types, in a client-server arrangement, wherein programs and data, many in the form of objects, are made 
available by various members of the system for execution and access by other members of the system. Some of the 
elements of a general purpose workstation computer are shown in Fig. 1 , wherein a processor 1 is shown, having an 
input/output ("I/O") section 2, a central processing unit ("CPU") 3 and a memory section 4. The I/O section 2 is connected 
to a keyboard 5, a display unit 6, a disk storage unit 9 and a compact disk read only memory ("CDROM") drive unit 7. 

10 The CDROM unit 7 can read a CDROM medium 8 which typically contains programs 10 and data. The computer 
program products containing mechanisms to effectuate the apparatus and methods of the present invention may reside 
in the memory section 4, or on a disk storage unit 9 or on the CDROM 8 of such a system. 

[0024] With reference now to Fig. 2, a simplified representational view of the architecture 20 for implementing the 
present invention is shown in conjunction with, for example, a System V-based UNIX operating system having a user 
15 (or system call) layer 22 and a kernel 24. With modifications to portions of the user layer 22 (i.e. the MDD3 and mount 
utilities 28) and kernel 24 (i.e. the UFS layer 30) as will be more fully described hereinafter, the present invention is 
implemented primarily by additions to the metatrans layer 26 in the form of a metatrans driver 32, transaction layer 34, 
roll code 36, recovery code 38 and an associated log (or journal) code 40. 

[0025] The MDD3 Utilities administer the metatrans driver 32 and set up, tear down and give its status. The mount 
20 utilities include a new feature ("-syncdir") which disables the delayed directory updates feature. The UFS layer 30 
interfaces with the metatrans driver 32 at mount, unmount and when servicing file system system calls. The primary 
metatrans driver 32 interfaces with the base MDD3 driver and the transaction layer 34 interfaces with the primary 
metatrans driver 32 and with the UFS layer 30. The roll code 36 rolls completed transactions to the master device and 
also satisfies a read request by combining data from the various pieces of the metatrans driver 32. The recovery code 
25 scans the log and rebuilds the log map as will be more fully described hereinafter while the log code presents the upper 
layers of the operating system with a byte stream device and detects partial disk drive write operations. 
[0026] With reference additionally now to Fig. 3, the major components of the architecture of the present invention 
is shown in greater detail. The UFS layer 30 is entered via the VOP or VPS interface 42. The UFS layer 30 changes 
the file system by altering incore copies of the file system's data. The incore copies are kept in the buffer or page cache 
30 41. The changes to the incore copies are called deltas 43. UFS tells the metatrans driver 32 which deltas 43 are 
important by using the transops interface 45 to the metatrans device 32. 

[0027] The UFS layer does not force a write after each delta 43. This would be a significant performance loss. Instead, 
the altered buffers and pages are pushed by normal system activity or by ITS at the end of the VOP or VPS interface 
42 call that caused the deltas 43. As depicted schematically, the metatrans driver 32 looks like a single disk device to 

35 the upper layers of the kernel 24. Internally, the metatrans driver 32 is composed of two disk devices, the master and 
log devices 44, 46. Writes to the metatrans device 32 are either passed to the master device 44 via bdev_strategy or, 
if deltas 43 have been recorded against the request via the transops interface 45, then the altered portions of the data 
are copied into a write buffer 50 and assigned log space and the request is I/O done. The deltas 43 are moved from 
the delta map 48 to the log map 54 in this process. 

40 [0028] The write buffer 50 is written to the log device 46 when ITS issues a commit (not shown) at the end of a VOP 
or VFS layer 42 call or when the write buffer 50 fills. Not every VOP or VFS layer 42 call issues a commit. Some 
transactions, such as lookups or writes to files *not* opened 0_SYNC, simply collect in the write buffer 50 as a single 
transaction. 

[0029] Reading the metatrans device 32 is somewhat complex because the data for the read can come from any 
45 combination of the write buffers 50, read buffers 52, master device 44, and log device 46. Rolling the data from the 
committed deltas 43 forward to the master device 44 appears generally as a "read" followed by a "write" to the master 
device 44. The difference is that data can also come from the buffer or page caches 41 . The affected deltas 43 are 
removed from the log map 54. The roll/read code block 56 is coupled to the master and log devices 44, 46 as well as 
the write and read buffers 50, 52 and interfaces to the buffer or page drivers 58. 
50 [0030] With reference now to Fig. 4, it can be seen that early in the boot process, the On-line: Disksuite ("ODS") 
state databases are scanned and the incore state for the metadevices is re-created. Each metadevice is represented 
by a unit structure and the unit structure for the metatrans devices contains the address of its logging device unit 
structure, and vice versa. The metatrans device 60 unit structure is mt_unit_t and is defined in md_trans.h. The logging 
device 62 unit structure is mLunit_t and is also defined in md_trans.h. 
55 [0031] Referring additionally now to Fig. 5, the logging device 62 unit structures are maintained on a global linked 
list anchored by uljist. Each of the metatrans device 60 unit structures for the metatrans devices 60 sharing a logging 
device 62 are kept on a linked list anchored by the logging device's unit structure. 

[0032] With reference additionally to Fig. 6, after the unit structures are set up, a scan thread is started for each 
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logging device 62. The scan thread is a kernal thread that scans a log device 62 and rebuilds the logmap 64 for that 
logging device 62. The logmap 64 is nnt_nnap_t and is defined in md_trans.h. The lognnap 64 contains a mapentry_t 
for every delta 43 in the log that needs to be rolled to the master device. The map entries 68 are hashed by the hash 
anchors 66 (metatrans device, metatrans device offset) for fast lookups during read operations. In order to enhance 
5 performance, the map entries 68 are also maintained on a linked list in the order in which they should be rolled in. As 
shown schematically in Fig. 7, the unit structures for the metatrans device 60 and the logging device 62 contain the 
address of the logmap 64 (log map 54 in Fig. 3), which is associated with the hashed mapentries 70 and all mapentries 
72. 

[0033] Referring also now to Fig. 8, a deltamap 74 is associated with each metatrans device 60. The deltamap 74 

10 stores the information about the changes that comprise a file system operation. The file system informs the metatrans 
device 60 about this changes (or deltas 43) by recording the tuple (offset on master device 44, No. of bytes of data 
and callback) with the device. The metatrans device 60 in conjunction with hash anchors 76 creates a mapentry 78 
for each delta 43 which is stored in the deltamap 74 (delta map 48 in Fig. 3). The deltamap 74 is an mt_map_t like the 
logmap 64 (Figs. 6-7) and has the same structure. 

15 [0034] With reference also to Fig. 9, at the end of a transaction, the callback recorded with each map entry 68 is 
called in the case of "writes" involving logged data. The callback is a function in the file system that causes the data 
associated with a delta 43 to be written. When this "write" appears in the metatrans driver, the driver detects an overlap 
between the buffer being written 80 and deltas 43 in the deltamap 74. If there is no overlap, then the write is passed 
on to the master device 44 (Fig. 3). If an overlap is detected, then the overlapping map entries are removed from the 

20 deltamap 74 and passed down to the logmap layer. 

[0035] The logmap layer stores the delta 43 + data in the log's write buffer 50 and puts the map entries into the 
logmap 64. It should be noted that the data for a delta 43 may have been written before the end of a transaction and, 
if so, the same process is followed. Once the data is copied into log's write buffer 50, then the buffer is iodone'ed. 
[0036] Among the reasons for using the mt_map_t architecture for the deltamap 74 is that the driver cannot user 

25 kmem_alloc. The memory for each entry that may appear in the logmap needs to be allocated before the buffer appears 
in the driver. Since there is a one-to-one correspondence between deltas 43 in the deltamap 74 and the entries in the 
logmap 64, it is apparent that the deltamap entries 78 should be the same as the logmap entries 68. 
[0037] Referring now to Fig. 10, the analogous situation of "reads" involving logged data is illustrated. As can be 
seen, the logmap 64 is also used for read operations. If the buffer being read does not overlap any of the entries 68 

30 in the logmap 64, then the "read" is simply passed down to the master device 44. On the other hand, if the buffer does 
overlap entries 68 in the logmap 64, then the data for the buffer is a combination of data from the master device 44 
and data from the logging device 46. 

[0038] With reference to Figs. 11 and 12, the situation at mount time is illustrated schematically. Early in the boot 

process, each metatrans device records itself with the UFS function, ufs_trans_set and creates a ufstrans struct 84 
35 and links it onto a global linked list. At mount time, the file system checks its dev_t against the dev_t's stored in the 
ufstrans structs 86. If there is a match, then the file system stores the address of the ufstrans struct 86 its file system 
specific per-mount struct, the ufsvfs 90. The file system also stores its generic per-mount struct, the vfs 88, in the 
ufstrans struct 86. This activity is accomplished by mountfs() and by ufs_trans_get(). The address of the vfs 88 is stored 
in the ufstrans struct 86 due to the fact that the address is required by various of the callback functions. 
40 [0039] The file system communicates with the metatrans driver 32 (Figs. 2-3) by calling the entry points in the uf- 
stransops 92 struct. These entry points include the beg in -ope rat ion, end-operation and record-delta functions. Together, 
these three functions perform the bulk of the work needed for transacting UFS layer 30 operations. Fig. 1 3 provides a 
summary of the data structures of the present invention as depicted in the preceding figures and as will be more fully 
described hereinafter. 

45 [0040] The metatrans device, or driver 32 contains two underlying devices, a logging device 46 and a master device 
44. Both of these can be disk devices or metadevices (but not metatrans devices). Both are under control of the me- 
tatrans driver and should generally not be accessible directly by user programs or other parts of the system. The logging 
device 46 contains a journal, or log. The log is a sequence of records each of which describes a change to a file system 
(a delta 43). The set of deltas 43 corresponding to the currently active vnode operations form a transaction. When a 

50 transaction is complete, a commit record is placed in the log. If the system crashes, any uncommitted transactions 
contained in the log will be discarded on reboot. The log may also contain user data that has been written synchronously 
(for example, by NFS). Logging this data improves file system performance, but is not mandatory If sufficient log space 
is not available user data may be written directly to the master device 44. The master device 44 contains a UFS file 
system in the standard format. If a device that already contains a file system is used as the master device 44, the file 

55 system contents will be preserved, so that upgrading from standard UFS to extension of the present invention is straight- 
forward. The metatrans driver updates the master device 44 with completed transactions and user data. Metaclear(lm) 
dissolves the metatrans device 32, so that the master device 44 can again be used with standard UFS if desired. 
[0041] The metatrans device 32 presents conventional raw and block interfaces and behaves like an ordinary disk 
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device. A separate transaction interface allows the file system code to communicate file system updates to the driver 
The contents of the device consist of the contents of the master device 44, modified by the deltas 43 recorded in the log. 
[0042] Through the transaction interface, UFS informs the driver what data is changing in the current transaction (for 
instance, the inode modification time) and when the transaction is finished. The driver constructs log records containing 

s the updated data and writes them to the log. When the log becomes sufficiently full, the driver rolls it forward. In order 
to reuse log space, the completed transactions recorded in the log must be applied to the master device 44. If the data 
modified by a transaction is available in a page or buffer in memory, the metatrans driver simply writes it to the master 
device 44. Othenwise, the data must be read from the metatrans device 32. The driver reads the original data from the 
master device 44, then reads the deltas 43 from the log and applies them before writing the updated data back to the 

10 master device 44. The effective caching of SunOS™ developed and licensed by Sun Microsystems, Inc., makes the 
latter case occur only rarely and in most instances, the log is written sequentially and is not read at all. 
[0043] UFS may also cancel previous deltas 43 because a subsequent operation has nullified their effect. This can- 
celing is necessary when a block of metadata, for instance, an allocation block, is freed and subsequently reallocated 
as user data. Without canceling, updates to the old metadata might be erroneously applied to the user data. 

15 [0044] The metatrans driver keeps track of the log's contents and manages its space. It maintains the data structures 
for transactions and deltas 43 and keeps a map that associates log records with locations on the master device 44. If 
the system crashes, these structures are reconstructed from the log the next time the device is used (but uncommitted 
transactions are ignored). The log format ensures that partially written records or unused log space cannot be mistaken 
for valid transaction information. A kernel thread is created to scan the log and rebuild the map on the first read or write 

20 on a metatrans device 32. Data transfers are suspended until the kernel thread completes, though driver operations 
not requiring I/O may proceed. 

[0045] One of the principle benefits of the present invention is to protect metadata against corruption by power failure. 
This imposes a constraint on the contents of the log in the case when the metatrans driver is applying a delta 43 to 
the master device 44 when power fails. In this case, the file system object that is being updated may be partially written 

25 or even corrupted. The entire contents of the object from the log must still be recovered. To accomplish this, the driver 
guarantees that a copy of the object is in the log before the object is written to the master device 44. 
[0046] The metatrans device 32 does not attempt to correct other types of media failure. For instance, a device error 
while writing or reading the logging device 46 puts the metatrans device 32 into an exception state. The metatrans 
device 32's state is kept in the MDD database. There are different exception states based on when the error occurs 

30 and the type of error. 

[0047] Metatrans device 32 configuration may be performed using standard MDD utilities. The MDD dynamic con- 
catenation feature allows dynamic expansion of both the master and logging devices 44, 46. The device configuration 
and other state information is stored in the MDD state database, which provides replication and persistence across 
reboots. The space required to store the information is relatively small, on the order of one disk sector per metatrans 
35 device 32. 

[0048] In a particular implementation of the present invention, UFS checks whether a file system resides on a me- 
tatrans device 32 at mount time by calling ufs_trans_get(). If the file system is not on a metatrans device 32, this function 
returns NULL; otherwise, it returns a handle that identifies the metatrans device 32. This handle is saved in the mount 
structure for use in subsequent transaction operations. The functions TRANS_BEGIN() and TRANS END() indicate 
40 the beginning and end of transactions. TRANS DELTA() identifies a change to the file system that must be logged. 
TRANS_CANCEL() lets UFS indicate that previously logged deltas 43 should be canceled because a file system data 
structure is being recycled or discarded. 

[0049] When the file system check ("fsck") utility is run on a file system in accordance with the present invention, it 
checks the file system's clean flag in the superblock and queries the file system device via an iocti command. When 
45 both the superblock and device agree that the file system is on a metatrans device 32, and the device does not report 
any exception conditions, fsck is able to skip further checking. Otherwise, it checks the file system in a conventional 
manner. 

[0050] When the "quotacheck" utility is run on a file system in accordance with the present invention, it checks the 
system's clean flag in the superblock and queries the file system device via an iocti command. When both the superblock 
50 and device agree that the file system is on a metatrans device 32, and the device does not report any exception 
conditions, quotacheck doesn't have to rebuild the quota file. Otherwise, it rebuilds the quota file for the file system in 
a conventional manner. 

[0051] The logging mechanism of the present invention ensures file system consistency with the exception of lost 
free space. If there were open but deleted files (that is, not referred to by any directory entry) when the system went 
55 down, the file system resources claimed by these files will be temporarily lost. A kernel thread will reclaim these re- 
sources without interrupting service. As a performance optimization, a previously unused field in the file system's 
superblock, fs_sparecon[53], indicates whether any files of this kind exist. If desired, fsck can reclaim the lost space 
immediately and fs_sparecon[53] will be renamed fs_reclaim. 
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[0052] Directories may be changed by a local application or by a daemon running on behalf of a remote client in a 
client-server computer system. In the standard UFS implementation, both remote and local directory changes are made 
synchronously, that is, updates to a directory are written to the disk before the request returns to the application or 
daemon. Local directory operations are synchronous so that the file system can be automatically repaired at boot time. 

s The NFS protocol requires synchronous directory operations. Using the technique of the present invention, remote 
directory changes are made synchronously but local directory changes are held in memory and are not written to the 
log until a sync(), fsyncQ, or a synchronous file system operation forces them out. As a result, local directory changes 
can be lost if the system crashes but the file system remains consistent. Local directory changes remain ordered. 
[0053] Holding the local directory updates in memory greatly improves performance. This introduces a change in file 

10 system semantics, since completed directory operations may now disappear following a system crash. However, the 
old behavior is not mandated by any standard, and it is expected that few, if any, applications would be affected by the 
change. This feature is implemented in conventional file systems, such as Veritas, Episode, and the log-structured file 
system of Ousterhout and Mendelblum. Users can optionally revert back to synchronous local directory updates. 
[0054] The MDD initialization utility, metainit(lm), may be extended to accept the configuration lines of the following 

15 form: 

mdNN -t master log [-n] 

mdNN A metadevice name that will represent the metatrans device, 

master The master device; a metadevice or ordinary disk device. 
20 log The log device; a metadevice or ordinary disk device. The same log may be used in multiple metatrans 

devices, in which case it is shared among them. 

[0055] Metastat may also be extended to display the status of metatrans devices, with the following format: 

25 mdXX: metatrans device 

Master device: mdYY 
Logging device:mdZZ 
<state information> 

30 mdYY: metamirror, master device for mdYY 
<usual status> 

mdZZ: metamirror, logging device for mdXX 
<usual status> 

35 

[0056] Fsck decides whether to check systems based on the state of the clean flag. The specific implementation of 
the present invention described herein defines a new clean flag value, FSLOG. If the clean flag is FSLOG and the 
metatrans device 32 is not in an exception state, "fsck -m" exits with 0 and checking is skipped. Otherwise, the clean 
flag is handled in a conventional manner and Fsck checks the state of the metatrans device 32 with a project -private 
40 iocti request. After successfully repairing a file system, fsck will issue a project -private iocti request that takes the 
metatrans device 32 out of the exception state. 

[0057] If the clean flag is FSLOG and the metatrans device 32 is not in an exception state then quotacheck skips 
the file system. Otherwise, quotacheck rebuilds the quotafile in a conventional manner. Quotacheck checks the state 
of the metatrans device 32 with a project-private iocti request. After successfully repairing a file system, quotacheck 
45 will issue a project-private iocti request that resets metatrans device 32's exception state. 

[0058] The ufs_mount program may accept a pair of new options to control whether or not to use delayed directory 
updates. 



50 



Header files 
[0059] 



<sys/f s/uf s_inode . h> 

struct ufsvfs may contain a pointer to struct metatrans to identify the metatrans 
device . 
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i_doff xa added to struct inode. 

<sys/ f s/uf s__quota . h> 
5 struct dquot may have the new field dq_do*f. 

<sys/f s/uf s_f s .h> 

The new clean flag value FSLCG is defined here. fs_sparecon [5 3] is renamed fs- 
reclaitn. 

10 <sys/f s/uf s_trans .h> 

<sys/md_trans . h> 

[0060] These are new header files that define project-private interfaces, e.g., metatrans iocti commands and data 
structures. 

IS 

Kernel Interfaces 

com mon/f s/uf s/*. c 

20 [0061] The VOP and VFS interfaces to UFS need not change unless a flag is added to the directory VOP calls to 
distinguish local and remote access. Calls to the metatrans logging interface are added to numerous internal UFS 

functions. 

common/vm/pagejock.c 

25 

[0062] The following functions allow conditional access to a page: paqejojock (), page_io_unlock (), 

page_io_trylock ut page_io_assert (). 

com mon/vm/vm_p u n . c 

30 

[0063] The following function allows release of the pages acquired using the preceding functions: pun_io_done. 
common/os/bio.c 

35 [0064] A new function, trygetblk (), is added to bio.c. This function checks whether a buffer exists for the specified 
device and block number and is immediately available for writing. If these conditions are satisfied, it returns a pointer 
to the buffer header, or NULL if they are not. 

[0065] Thread-specific data ("TSD") may be utilized for testing. Each delta 43 in a file system operation will be as- 
sociated with the thread that is causing the delta 43. 
40 [0066] UFS mount stores the value returned by ufs_trans_get () in the ufsvfs field vfs_trans. A NULL value means 
that the file system is not mounted from a metatrans device 32. UFS functions as usual in this case. A Non-NULL value 
means the file system is mounted from a metatrans device. In this case: 

a) The on-disk clean flag is set to FSLOG and further clean flag processing is disabled by setting the in-core clean 
45 flag to FSBAD. Disabling clean flag processing saves CPU overhead. 

b) The DIO flag is set unless the "nosyncdir" mount option is specified. Local directory updates will be recorded 
with a delayed write. A crash could lose these operations. Remote directory operations remain synchronous. Di- 
rectory operations are considered remote when T_DONTPEND is set in curthread->t_flag. 

50 

c) An exception routine is registered with the metatrans device 32 at mount time. The metatrans drive calls this 
routine when an exception condition occurs. Exception conditions include device errors and detected inconsisten- 
cies in the driver's state. The UFS exception routine will begin a kernel thread that hard locks the affected file 
systems. 

55 

[0067] Each UFS Vnode or VFS operation may generate one or more transactions. Transactions may be nested, 
that it a transaction may contain subtransactions that are contained entirely within it. Nested transactions occur when 
an operation triggers other operations. Typically, each UFS operation has one transaction (plus any nested transactions) 
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associated with it. However, certain operations sucli as VOP_WRITE and VFS_SYNC are divided into multiple trans- 
actions when a single transaction would exceed the total size of the logging device 46. Others such as VOP_CMP and 
VOP_ADDMAP, do not generate any transactions because they never change the file system state. Some operations 
that do not directly alter the file system may generate transactions as a result of side effects. For example, 
5 VOP_LOOKUP may replace an entry in the dnic or inode cache, causing in-core inodes to become inactive and the 
pages associated with them to be written to disk. 

[0068] Transactions begin with a call to TARNS_BEGIN (). The transaction terminates when TRANS_END is called. 
A transaction is composed of deltas 43, which are updates to the file system's metadata. Metadata is the superblock, 
summary information, cylinder groups, inodes, allocation blocks, and directories. UFS identifies the deltas 43 for the 
10 metatrans device 32 by calling TRANS_DELTA (). This call identifies a range of bytes within a buffer that should be 
logged. These bytes are logged when the buffer is written. UFS often alters the same metadata many times for a single 
operation. Separating the declaration of the delta 43 from the logging of the delta 43 collapses multiple updates into 
one delta 43. 

[0069] UFS obtains disk blocks for user data and allocation blocks from the same free pool. As a result, user data 
15 may occupy locations on disk that contained metadata at some earlier time. The log design must ensure that during 
recovery, the user data is not incorrectly updated with deltas 43 to the former metadata. UFS prevents this by calling 
TRANS_CANCEL () whenever a block is allocated for user data. 

[0070] Writes to the raw or block metatrans device 32 can invalidate information recorded in the log. To avoid incon- 
sistencies, the driver transacts these writes. 

20 [0071] The logging device 46 increases synchronous write performance by batching synchronous writes together 
and by writing the batched data to the logging device 46 sequentially. The data is written asynchronously to the master 
device 44 at the same time. The synchronous write data recorded in the log is not organized into transactions. The 
metatrans device 32 transparently logs synchronous write data without intervention at the file system level. Synchro- 
nously written user data is not logged when there is not sufficient free space in the log. In this case, an ordinary 

25 synchronous write to the master device 44 is done. 

[0072] When synchronous write data is logged, any earlier log records for the same disk location must be canceled 
to avoid making incorrect changes to the data during recovery or roll-forward. When the asynchronous write of the 
data to the master device 44 has finished, the metatrans driver's done routine places a cancel record on a list of items 
to be logged. Subsequent synchronous writes to the same disk location are followed by a synchronous commit that 

30 flushes this record to the log and cancels the previous write. Subsequent asynchronous writes to the same location 
will disappear at reboot unless they are followed by a sync (), fsync () or further synchronous update. The correctness 
of this scheme depends on the fact that UFS will not start a new write to a disk location while a preceding one is still 
in progress. 

[0073] The master device 44 is periodically updated with the committed changes in the log. Changes recorded at 
35 the head of the log are rolled first. Three performance measures reduce the overhead of rolling the log. First, the driver 
avoids reading the log when the required data is available, either in the buffer cache or in the page cache. Two new 
routines, trygetblk () and ufs_trypage (), return a buffer header or a page without sleeping or they return NULL. Second, 
overlapping deltas 43 are canceled. If the log contains multiple updates for the same data, only the minimum set 
required is read from the log and applied. The third measure involves the untransacted synchronous write data. This 
40 data is written synchronously to the logging device 46 and asynchronously to the master device 44. The roll logic simply 
waits for the asynchronous write to complete. 

[0074] Rolling is initiated by the metatrans driver. When the logging device 46 fills, the metatrans driver immediately 
rolls the log in the context of the current thread. Otherwise, the metatrans driver heuristically determines when rolling 
would be efficient and it starts a kernel thread. An obvious heuristic for this case is when the metatrans driver has been 
45 idle for several seconds. The log is not rolled forward at fsync (), sync () or unmount but is rolled when the metatrans 
device 32 is cleared by the metaclear (1m) utility. 

[0075] The metatrans device 32 puts itself into an exception state if an error occurs that may cause loss of data. In 
this state, the metatrans device 32 returns ElO on each read or write after calling all registered "callback-on-exception" 
routines for the device. UFS registers a callback on routine at mount time. The UFS routine starts a kernel thread that 

50 hard locks the affected UFS file systems, allowing manual recovery. The usual procedure is to unmount the file system, 
fix the error, and run fsck. Fsck takes the device out of the exception state after it repairs the file system. The file system 
can then be mounted, and the file system functions as normal. If the file system is unmounted and then mounted again 
without running fsck, any write to the device returns ElO but reads will proceed if the requested data can be accessed. 
[0076] UFS must not exhaust log space and, if the metatrans driver cannot commit a transaction because of insuf- 

55 ficient log space, it treats the condition as a fatal exception. UFS avoids this situation by splitting certain operations 
into multiple transactions when necessary. The UFS flush routines create a transaction for every ufs_syncip () or 
VOP_PUTPage call. The flush routines are ufs_flushi (), ufsjflush (), and ufs_flush_icache (). The affected UFS op- 
erations are VFS_Sync and VFS_UNMOUNT and the UFS ioctis FIOLFS, FIOFFS, and FIODIO. A VOP_WRITE op- 
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eration is split into multiple rwip () calls in ufs_write (). 

[0077] Freeing a file in ufs_iinactive () cannot be split into multiple transactions because of deadlock problems with 
transaction collisions and recursive UFS operations and freeing of the file is delayed until there is no chance of deadlock. 
[0078] The metatrans driver does not recover the resources held by open, deleted files at boot. Instead, UFS man- 
5 ages this problem. A kernel thread created at mount time scans for deleted files if: 

a) The file system is on a metatrans device 32, or 

b) The superblock says there are deleted files. A bit in a previously unused spare in the superblock indicates 
10 whether any such files are present. 

[0079] The metatrans device 32 driver handles three classes of errors: "device errors", "database errors", and "in- 
ternal errors". Device errors are errors in reading or writing the logging or master devices 46, 44. Database errors are 
errors reported by MDD's database routines. Internal errors are detected inconsistencies in internal structures, including 

15 Structures written onto the logging device 46. 

[0080] A mounted metatrans device 32 responds to errors in one of two ways. The metatrans driver passes errors 
that do not compromise data integrity up to the caller without any other action. For instance, this type of error can occur 
while reading unlogged data from the master device 44. The metatrans device 32 puts itself into an exception state 
whenever an error could result in lost or corrupted data, for example, an error reading or writing the logging device 46 

20 or an error from MDD's database routines. A metatrans device 32 puts itself into an exception state by: 

a) Recording the exception in MDD's database, when possible. 

b) Calling any registered "callback-on-exception" routines. These routines are registered with the device at mount 
25 time. UFS registers a routine that starts a kernel thread that hard locks the affected UFS file systems. These file 

systems can be unmounted and then remounted after the exception condition has been corrected. 

c) Returning ElO for every read or write call while the metatrans device 32 is mounted. 

30 [0081] After the metatrans device 32 is released by UFS at unmount with ufs_trans_put (), reads return ElO when 
be the requested data cannot be accessed and writes always return ElO. This behavior persists even after the metatrans 
device 32 is mounted again. 

[0082] When fsck repairs the file system, it takes the metatrans device 32 out of its exception state. Fsck first issues 
a project- private iocti that rolls the log up to the first error and discards the rest of the log and makes the device writable. 

35 After repairing the file system fsck issues a project -private ioctI that takes the device out of its exception state. At boot 
time, the logging device 46 is scanned and the metatrans device 32's internal state is rebuilt. A device error during the 
scan puts the metatrans device 32 in the exception state. The scan continues if possible. An unreadable sector resulting 
from an interrupted write is repaired by rewriting it. The metatrans device 32 is not put into an exception state. 
[0083] Roll forward operations may happen while scanning the logging device 46 and rebuilding the internal state. 

40 Roll forward operations happen because map memory may exceed its recommended allocation. Errors during these 
roll forward operations put the metatrans device 32 into an exception state and the scan continues if possible. 
[0084] It is recognized that delayed recording of local directory updates can improve performance. Two mechanisms 
for differentiating local and remote (NFS) directory operations may be implemented: a) UFS can examine the p_as 
member of the proc structure (If it is null then the caller is a system process, presumably NFS; otherwise the operation 

45 has been initiated by a user-level process and is taken to be local); or b) add a new flag to the Vnode operations for 
directories that specifies whether or not the operation must be synchronous (or add a new flag to the thread structure). 
[0085] Resources associated with open but deleted files must be reclaimed after a system crash and the present 
invention includes a kernel thread for this purpose. However, a thread that always searches the entire file system for 
such files has two disadvantages: the overhead of searching and the possibly noticeable delay until space is found 

50 and recovered. An alternative is to use a spare field in the superblock to optimize the case where there are no such 
files, which would likely be a fairly common occurrence. 

[0086] The FIOSDIO ioctI puts the UFS file system into delayed lO mode, which means that local directory updates 
are written to disk with delayed writes. Remote directory updates remain synchronous, as required by the NFS protocol. 
This mode makes directory operations very fast but without the present invention it is unsafe and repairing a file system 
55 in DIO mode will usually require user intervention. The logging mechanism of the present invention ameliorates the 
danger. To improve directory update performance, file systems may be placed into delayed lO mode unless the "nosyn- 
cdir" mount option is specified. However, the implementation of delayed lO mode changes considerably and a solution 
is to avoid use of the FIOSDIO flag and instead use a different, specific flag. This specific flag might be administered 
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by a new utility and a project-private UFS ioctl. The new flag could be stored in the superblock or could be stored in 
MDD's database. The FIOSDIO ioctl would then have no effect on a file system in accordance with the present invention. 

UFS Interface to Metatrans Device 

5 

[0087] A metatrans device 32 records itself with UFS when the metatrans device 32 is created or is recreated at boot: 

struct ufstrans* 
10 uf s_trans_sec ( 

dev_t dev, 

struct ufstransops 'ops. 



IS void •data) 



dev is the metatrans device number, data is the address of a metatrans-private structure, ops is the address of the 
branch table: 

20 

acrucc ufscansops ( 

inc ('Crana^begin) tscrucc ufscrana eop_c,u_long, u__long) ; 
void (•tran^end) J struct u£scrans top_c, u_lor.g, u_lor.3) ; 

void { •Cran9__delta» (struct ufstrans ott^z. oei_z. delta_t, int (♦)(). u long) ; 
25 void ( •tran3_cancel) (struct ufstrans oti^z. oCC_c, delta^t); 

inc (•trans_^log) (acrucc uCscrans char qCC_z. ott_z) ; 

void ( •crans_mounc) (struct ufstrans struct 's 

void ( •trans_uiiinounc) (struct ufstrans *, struct fs 

void (♦crans_^remount} (struct ufstrans struct fs •),• 

void {•trans_igec) (struct ufstrans struct inode •) ; 

void (•crana_eree_iblk) (acrucc ufacrana acrucc inode *, daddr_t) ; 

void (♦crana_f reel (struct ufstrans acrucc inode daddr_c, u^long) ; 

void (•Crans_alloc} (acrucc ufscrana acrucc inode daddr_c; u_long, inc) ; 

); 



35 ufs_trans_set stores the above information in a singly linked list of: 



40 



struct ufscrana { 

struct ufstrans 
devc 

struct ufstransops 

struct vfa 

void 

void 

int 



•ut_next 
ut_dev; 

*Ut_CpS; 
*uc_vf sp; 
•Ut data; 
(•ut_onerror) () ; 
UC onerror stace,- 



/• next item in list •/ 

/♦ mecatrans device no. •/ 

/• metatrans ops •/ 

/• XXX for inode pushes •/ 

/♦ private data (?) •/ 

/* call&aclc Ufa on error «/ 
/* fa specific scace 



45 



ufs_trans_reset() unlinks and frees the ufstrans structure. ufs_trans_reset () is called when a metatrans device is 
cleared. 

[0088] At mount time, UFS stores the address of a ufstrans structure in the vfs_trans field of a struct ufsvfs: 
50 ufsvfsp->vfs_trans, = ufs_trans_get(dev, vfsp, ufs_trans_on error, ufs_trans_onerror_state); 

If ufs_trans_get returns NULL when the file system is not on a metatrans device 32, ufs_trans_on error is called by the 
metatrans device 32 when a fatal device error occurs. ufs_trans_onerror_state is stored as part of the metatrans device 
32's error state. This error state is queried and reset by fsck and quotacheck. 

[0089] UFS calls the metatrans device via the ufstransops table. These calls are buried inside of the following macros: 
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/• 

• vfs^cranj SCLL nutans no m«cacraiis dsvlctt 
/• 

UdeSine 7IWNS_:STllMrs (uf svjspi (uCsvf sp->vf s^crans) 

5 

/• 

• begin a cransaction 
/- 

•defiaa TRANS_3EGIN (uf avf sp. vid, vsire. flag) 
(TRAKS^ISTRANS (ursvCsp) ) ? 

( •ui»vSsp->vf s_crans- >uc^op«- »Crana_b«gin) 
10 (u2svfsp->vfs_crans, vid, vaize. flag) : 0) 

/• 

• end a cransacricn 
/• 

ttdcfine 'nUMS_£ND<ufsvCsp, vid, vslze, flag) 
if (TjlAJ*S_ISTRANS{uf»v£sp) ) 

( "uf ivf sp- >vCs_craiis - >uc_ops - >crans_cnd) 

(u£svfsp->v£a_crans. vid, vsiz«, flag) 



15 



•record a delca 
/• 

20 #defin« TTIANS_DELTA (uf svfsp, mof. nfa. dtyp, func, arg) 

if (TTtW/S^ISTRASStufsvfsp) J 

(•ufsvfsp- >vf s_=rans« >uc_ops- >traiis_^deita) 

{u£svtsp->vCs_crans. taoC, nb, dtyp, func, arg) 



25 



35 



40 



SO 



/* 

^cancel a d«lca 
/• 

tfdefine TltANS_CArfCSL(uf£vf sp. oiof, nb, dtyp) 
if (T1tANS_ISTRASS(ufsvfspH 

( «uf avf jp- »v£s_^trana->uc_opa - >crazu_cancel) 

(ufsv£9p*>vfs_crana, nof, nb. dcyp) 



30 /• 

* log a delca 

/• 

lldefina TKAHS_LOG(ufsv£sp. va, mof, nb) 
If (TRANS_IST*yWS{uCsvfsp) } 

{•u£sv{sp->v£ a_t rans - > u c_Op a - > t rana__l og ) 

(u£sv£sp->vfs_crans. va, oof, nb) 

/• 

• Th« following macros provide a more readable interface co TRANS_DBLTA 
/• 

tfdefine TRANS_Bt;r(uSsvfsp. vof, nb, bp, cypel 
TRANS -DELTA ( ufsvf sp . 

dbcob(bp*>b_bllaio) ■•' vof. nb. eyre, 

uf s_erana_puah_buf , bp->b_bl)cno) 

•define t11ANS_B07_ZTEM (ufsvfsp, item, base, bp, type) TRAHS_OELTA ( uf svf sp , 

{caddr^O t(icem) - (caddr^t) (base) , 
sif eoC (icera) , bp. type) 

45 ideCine TRANS^INODS (uf svJsp, vof, nb. ip) 

TRA»S_DELTA(u£svisF. ip->i_doff -vof. 

nb, DT_:nODE, uf s_crans_push_^inode. ip 

«deCin« TRANS_r>JCDE_ITEM (uJsvfsp, item, ip) 

TRANS-rNODS(ufsvf sp. {caddr_rJ t(icem) - (caddr_c) tip->i_ic. sizeof deem) . ^) 



•define TRANS_SI (uf svf sp, t%, eg) 
TKANS.O ELTA { uf Bvf sp . 

dbtob(£sbtadbffs, fs->f s__csaddr) ) * 
{caddr_c)tCs-»fs_cs{fs. eg) - (cadr_c) fs->f s_csp (01 . 
siseof (scrucc csunt . OT_SI. ufs_tran»_push_si. eg) 



55 Udefine TRANS_SB(ufsvf sp. item, fs) 

TRANS^OSLTA ( uf svf sp . 
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dbcofci(SSLCCX) ♦ ( (caddr_t) t(icem) - (caddr^c) fs) . 
sizeoC (icem). DT^SB. uf s_trans_push_sb. 0) 



• Tnese functions -wrap" Ctmctioas ciiac are not voP or VFS 
entry pcmts but rmsz still use the TRAILS _B EG IN/ TRANS _£LVD 

• protocol 
•/ 

^define T!lANS_SaO?DATE (uJsvf sp , vfsp. CCpid) 

u£s_crans_sb"pdace(uf svfsp, vfsp. topid) 
Ifdefine TRANS_SYNCIP ( ip. bflags, iClag. copid) 

ufs_trans_syneip(ip, b£lags. if lag. copid) 
»d«tin« TTlAilS_S3WRirB(ufsvfsp, copid) f a_trans_sbwrice (uf svf sp. copid) 
((define TRANS_:UPDAT(ip, wait fori uf s_crans_iupdat ( ip, 
Kdefina TJIAWS^PUTPAGES (vp, off, l«n. flags, cred) 

uf 3_crana_pucpages <vp, off, len. flags, cred) 



Tesc /Debug ops 

• The following cps maintain che metadata nvap. 

»define TRANS_IG5T (uf svCsp, ip) 

if (TRANS_ISTKANS (uf svf sp) ) 

{ -uf svf sp- >vf s_Crans - >uc_ops - >cran8_igec ) 

(uf svf sp->vf s^trans. ip, bno, sise) 
•define TRANS_FftEE_lBUt (ufsvf sp. ip. bnl 
if TRANS_ISTRANS(ufsvfsp) ) 

(•uf svf sp-»vf s_crans->uc_ops->Cran3_;re«_ibl)c) 
(uf svf sp->vf 8_cran5, ip, bnj 



#dttfine TRANS^ISTRANS (uf svf sp, ip. bno. sise) 
if TRANS_ISTSANS(ufsvfsp) » 

( 'uf svf sp- >vf s^crans- >uc_ops- >crans_f re«) 

(ufsvf sp->vfs_^crans, ip, bno, sire) 

»define TBANS_AI.LOC(uf svf sp. ip. bno, size, zero) 
if (TRANS_ISTRANS(ufsvf»p) ) 

{ •uf svf sp- >vf s_crans- >ut_ops - >crans_alloc) 

(uf svf sp'>vf s^crans. ip. bno, sis*, zero) 



^define trmJS^MOUNT (uf svf sp. fsp) 

if (TRANS^ I STRAWS (uf svf Sp) ) 

( -uf svf sp - » vf s_c r ans - >uc_op« - > c rans_Biounc ) 
(uf svfsp->vf s^Crans, fsp) 

ttdeflne TKANS_aMOONT(Uf svf sp, fsp) 
if (TRANS^ISTRAWS (uf svf sp) ) 

( -uf svf sp - >v£s_tran« - >ue_ops - >cran«_uinount ) 
(ufsvfsp->vfs_crans. fsp) 



Wefine TRANS_RE«OUirr(uf svf sp. fsp) 
if TRANS_ISTTlANS(ufsvfsp) ) 

(•ufsvfsp->vf s_crans-»ut_op8->trans_remounc) 
(ufsv£sp'>vf9_crans. fsp) 



[0090] Besides the vfs_trans field in tine ufsvfs struct, a new field, off_t i_doff, is added to the *incore* inode, struct 
inode. i_doff is set in ufsJgetQ. i_doff is the device offset for the inode's dinode. i_doff reduces the amount of code for 
the TRANS_INODE() and TRANSJNODE_ITEM() macros. Similarly, the field dq_doff is added to the "inocre" quota 
structure, struct dquot. 

[0091] The protocol between ufs_iinactive () and ufs_iget() is changed because the system deadlocks if an operation 
on fs A causes a transaction on fs B. This happens in ufsjinactive when it frees an inode or when it calls ufs_syncip 
(). This happens in ufsJgetQ when it calls ufs_syncip() on an inode from the free list. In the implementation of the 
present invention, a thread cleans and moves idle inodes from its idle queue to a new 'really-free' list. The inodes on 
the 'really-free' list are truly free and contain no state. In fact, they are merely portions of memory that happen to be 
the right size for an inode. ufsJgetQ uses inodes off this list or kmem_alloc()'s new inodes. 

[0092] The thread runs when the number of inodes on its queue exceeds 25% of ufs_ninode. ufs_ninode A is the 
user-suggested maximum number of inodes in the inode cache. Note that ufs_ninode does not limit the size of the 
inode cache. The number of active inodes and the number of idle inodes with pages may remain unbounded. The 
thread will clean inodes until its queue length is less than 12.5% of ufs_ninode. 
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[0093] Some new counters may be added to inode stats structure: 



/• Statistics on inodes •/ 
acrucc instacs ( 



int 


in_hits ; 


/• 


Cache hits •/ 




int 


in_misses; 


/• 


Cache misses «/ 




inc 


in_nalloc; 


/• 


kin«in_allocaced •/ 




inc 


in^rafree; 


/• 


kmem^free'd •/ 




inc 


in_inaxsize; 


/• 


Larsesc siza reached by cache 




inc 


In^Srfronc; 


/• 


puc ac fronc of freelisc •/ 




inc 


in_frback; 


/• 


puc ac back of freelisc */ 




inc 


in^dnlclook; 


/• 


examined in dnlc •/ 




inc 


in_dnlcpur9ef 


/' 


purged from dnlc •/ 




inc 


in_inaccive; 


/• 


inaccive calls -/ 




inc 


in_inact ive_nop 


/• 


inaccive calls that nop'ed •/ 




inc 


in_inac t ive_nul 1 


/• 


inaccive calls wich null vfsp 




inc 


in__^lnactive_delay^f ree; 


/* 


inactive delayed free's •/ 




inc 


in_inaccive_f ree ; 


/• 


inactive q's to :re- thread -/ 




inc 


i n_ I na c t i ve_ idl e ; 


/• 


inaccive q's to idle thread •/ 




inc 


in_inaccive_wakeupa; 


/• 


wakeups ♦/ 




inc 


in_scan; 


/♦ 


calls to scan •/ 




inc 


in_scan_scan; 


/• 


inodes Cour.d •/ 




inc 


in_scan^rw£ail ; /• inode 


rv^tryenter's that failed •/ 





[0094] ufsjinactive frees the ondisk resources held by deleted files. Freeing inodes in ufsjinactive () can deadlock 
be system as above-described and the same solution may be used, that is, deleted files are processed by a thread. 
The thread's queue is limited to ufs_ninode entries. ufs_rmdir() and ufs_remove() enforce the limit. 
[0095] The system deadlocks if a thread holds the inode cache's lock when it is suspended while entering a trans- 
action. A thread suspends entering a transaction if there isn't sufficient log space at that time. The inode scan functions 
ufs_flushi, ufsjflush, andufs_flush inodes use a single scan-inode-hash function that doesn't hold the inode cache lock: 
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•/ 

• scan Che hash ot inod*s and call £unc wich the inoda locked 
•/ 

inc 

uf s_scan_lnodes (int rwcry, inc {*£iinc3 (scrucc inode void*). void 'arg) 

{ 

scrucc inode 'ip. *lip; 

struct vnode "vp; 

union ihead •ih; 

inc error; 

int aaverror- 0; 

excern JcrwlocK_t icache_lock; 

ina.in^acan**; 

rv_encer ( ticache^lock , RW_readbr) ; 

for (ih - ihead; ib < fcihead [INOKSZ] ; ih**) { 

Cor (ip -ih->ih_chain[0) . lip - NULL; 

ip ! - (struct inode -jih; 

ip • lip->i_forw) 

ins . in_scan_8can+* ; 

vp . tTOV(ip) ; 
VH-HOLD(vp) ; 
rv_^exit (fcicache_locJc) ; 
if (lip) 

VN_RBLB (ITOV(lip) ) ; 

lip - ip; 

/• 

• Acquire the contents loOc to make sure that the 

• inode has been initialized in the cache. 



if (rwtry) 

if ( !rv_tryencar(tip-»i_contents, RM_WRITER) ) 

ins . in_8can_ri*f ail** ; 
rw_enter{S>icache_lock, RH_rsaO£R) ; 

continue; 

) 

} else 

rw_enter{tip-»i_contents, RW_writer) ; 
rw_exic (tip->i_contents) ; 

•/ 

• i_number 0 means bad initialisation; ignore 
♦/ 

if (ip- >i_numher) 

if (error - (*func) (ip, arg) J 
saverror - error; 



rw_encer (iicache^Iock. RW^readeri ; 
} 

if (lip) { 

rw_exisc (bicache^lock) ; 

VN_RELE IITOVdip}; ; 
r«_enter(iicacfce_lock, RW, READER); 
} 

) 

rw_ex3.t lfcicache_lock) ; 
return (saverror) ; 



[0096] ufsjget uses the same protocol. This protocol is possible because the new iget/i inactive protocol obviates 
the problenns inherent in attempting to reuse a cached inode. 

[0097] The lockfs flush routine, ufs_flush inodes, is altered to effectuate the present invention. ufs_flush-inodes hides 
inodes while flushing them. The inodes are hidden by taking them out of the inode cache, flushing them, and then 
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putting them back into tine cache. However, hidden inodes cannot be found at the end of transactions. ufs_flush_inodes 
now uses the new inode hash scan function to flush inodes. 

[0098] ufs_unmount() is nnodified to use the lockfs protocol and the new inode hash scan function, ufs-unmount also 
manages the UFS threads. All of the threads are created, controlled, and destroyed by a common set of routines in 
5 ufs_thread.c. Each thread is represented by the structure: 

•/ 

• each Ufa thread is managed by Chi a struct (u£s_chread.c) 



10 



struct uCs_q ( 








vcid 


•uq_head ; 


/♦ 


first entry on q •/ 


void 


•uq^cail ; 


/• 


last entry on q 


long 


uq_ne; 


/• 


If of entries •/ 


long 


uq^Tiajoie ; 


/• 


thread runs when ne--iTiaxne */ 


u__short 


uq_nt ; 


/* 


If of threads serving this q •/ 


u_8hort 


uq_nf ; 


/• 


« of Clushe* requested •/ 


u_shorx 


uq_f lags,■ 


/♦ 


flags •/ 


kcondvar_t 


uq_cv; 


/• 


Cor sleep/wakeup •/ 


kiiiucex_t 


uq_tnucex; 


/• 


protects this struct •/ 



20 [0099] With reference to the following pseudocode listing, the single transaction technique for a journaling file system 
of a computer operating system may be further understood. 

SINGLE_TRANSACTION: 

25 [0100] 

If single transaction is closed 

wait for next single transaction to open 

Enter transaction 
so Perform the synchronous operation 

Close this single transaction 

Wait for all current sync operations to finish 

Commit all sync operations with single disk write 

Open next single transaction 
55 Leave transaction 

[0101] UFS tells the metatrans device when transactions begin and end with the macros: 

TRANS_BEGIN(ufsvfsp , vop_id, vop_size, &vop_flag); 

TRANS_END(ufsvfsp, vopjd, vop_size, &vop_flag); 
40 vopJd identifies the operation. For example, VA_MOUNT for mount() and VA_READ for read(). vop_size is an upper 
bound on the amount of log space this transaction will need. vop_flag tells the metatrans driver if this thread must wait 
for the transaction to be committed or not, and whether this thread can sleep. 

[0102] Table 1 (hereinafter) illustrates "commit" and "NFS commit" assertions for various system calls. Fundmentally 
using the technique of the present invention, transacted operations will not cause synchronous writes if they do not 

45 require a commit and those transacted operations that do require a commit will generate fewer synchronous writes. 
[0103] As can be seen in Table 1 , some transacted operations do not require a commit unless they originate on an 
NFS client. Nevertheless, even the NFS-only-commit operations require a commit if the file system is mounted with 
the -syncdir option. The operations that do not require a commit can be lost if the system goes down. These operations 
are "committed" along with the next committed operation. For example, at the next sync. 

50 [01 04] Concurrent file system operations are combined into a single transaction. The file system operations needing 
a commit will not return until all of the file system operations are complete. The file system operations that do not 
require a commit will return immediately. 

[0105] A file system operation may be suspended if its log space needs cannot be met and UFS may split writes into 
multiple transactions if the log is too small. Moreover, UFS may split truncations into multiple transactions if the log is 
55 too small. 
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TABLE 1 . 





System Call 


Commit 


NFS Commit 


s 


TOP_OPEN 








TOP_CLOSE 








TOP_READ 








TOP_WRITE 




Y 


10 


TOP_WRITE_SYNC 


Y 


Y 




TOP_GETATTR 








TOP_SETATTR 




Y 


15 


TOP_SETATTR_TRUNC 




Y 




TOP_ACCESS 








TOP_LOOKUP 






20 


TOP_CREATE 




Y 


TOP_REMOVE 




Y 




TOP_LINK 




Y 




TOP_RENAME 




Y 


25 


TOP_MKDIR 




Y 




TOP_RMDIR 




Y 




TOP_READDIR 






30 


TOP SYMLINK 




Y 


TOP_READLINK 








TOP_FSYNC 




Y 




TOPJNACTIVE 






35 


TOP_FID 








TOP_GETPAGE 








TOP_PUTPAGE 






40 


TOP_MAP 






TOP_FRLOCK 








TOP_SPACE 




Y 




TOP_PATHCONF 






45 


TOP VGET 








TOP_SBUPDATE_FLUSH 








TOP_SBUPDATE_UPDAT E 






50 


TOP_SBUPDATE_MOUNT ROOT 






TOP_SBUPDATE_UNMOU NT 








TOP_SYNGIP_GLOSEDQ 








TOP_SYNCI P_TRYPAG E 






55 


TOP_SYNGIP_FLUSHI 








TOP_SYNCIP_HLOCK 
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TABLE 1. (continued) 



System Call 


Commit 


NFS Commit 


TOP_SYNCIP_SYNC 






TOP_SYNCIP_FREE 






TOP_SYNCIP_FSYNC 




Y 


TOP_SBWRITE_FIOSDI O 






TOP_SBWRITE_CHECKC LEAN 






TOP_SBWRITE_RECLAI M 


Y 


Y 


TOP_SBWRITE_T_RECL AIM 


Y 


Y 


TOP_SBWRITE_NOTCLE AN 


Y 


Y 


TOPJFREE 






TOPJUPDAT 






TOP_MOUNT 






TOP_COMMIT_FLUSH 






TOP_COMMIT_UPDATE 






TOP_COMMIT_UNMOUNT 







[0106] While there have been described above the principles of the present invention in conjunction with specific 
computer operating systems, the foregoing description is made only by way of example and not as a limitation to the 
scope of the invention, as defined by the following claims. 



Claims 

1. A method for writing data to a computer mass storage device in a single write operation in conjunction with a 
computer operating system having a journaling file system, said method comprising the steps of: 

providing for opening a single file system transaction for accumulating a plurality of current synchronous file 
system operations; 

providing for performing said plurality of current synchronous file system operations; 

providing for closing said single file system transaction upon completion of a last of said current file system 

operations; and 

providing for committing said single file system transaction containing said plurality of current synchronous 
file system operations to said computer mass storage device in a single write operation. 

2. The method of claim 1 , wherein prior to said step of providing for opening, the method additionally comprises the 
steps of: 

providing for entering a first of said synchronous file system operations; and 
providing for waiting for said single file system transaction to be opened. 

3. The method of claim 1 , wherein said step of providing for committing further comprises the step of: 

providing for writing said single file system transaction containing said plurality of current synchronous file 
system operations to said computer mass storage device. 

4. The method of claim 1, wherein said steps of providing for opening, closing and committing are carried out by 
means of a metatrans device (32) coupling a file system layer (30) of said operating system to a driver (58) for 
said computer mass storage device. 

5. A computer (1 ) including a computer operating system loadable thereon for running application programs, a com- 
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puter mass storage device (9) associated with said computer for receiving data in response to a journaling file 
system of said operating system, wherein said operating system comprises: 

a transaction layer (34) responsive to said journaling file system for creating a single file system transaction 
for accumulating a plurality of synchronous file system operations; 

a file system layer (30) responsive to said transaction device (34) for performing said plurality of current syn- 
chronous file system operations; and 

a metatrans driver (32) responsive to said transaction layer (34) and said file system layer (30) for committing 
said single file system transaction to said computer mass storage device in a single write operation. 

6. The computer of claim 5, wherein said metatrans driver (32) comprises a log device (46) responsive to said trans- 
action layer for storing data entries representative of said plurality of current synchronous file system operations. 

7. The computer of claim 6, wherein said metatrans driver (32) further comprises a write buffer (50) coupled to said 
log device (46) for accumulating said data entries. 

8. The computer of claim 7, wherein said metatrans driver (32) further comprises a delta map (48) coupling said file 
system layer (30) to said write buffer (50) for recording information corresponding to changes in each of said 
plurality of synchronous file system operations. 

9. The computer of claim 8, wherein said metatrans driver (32) further comprises a log map (54) coupled to said log 
device (46) and said write buffer (50) for storing information corresponding to entries in said write buffer. 



Patentanspruche 

1. Verfahren zum Schreiben von Daten in eine Computermassenspeichervorrichtung in einer einzigen Schreibope- 
ration in Verbindung mit einem Computerbetriebssystem mit einem Journaldateisystem, wobei das Verfahren die 
folgenden Schritte enthalt: 

Offnen einer einzigen Date isystemtransakt ion zum Akkumulieren mehrerer momentaner synchroner Dateisy- 
stemoperationen; 

Ausfuhren der mehreren momentanen synchronen Dateisystemoperationen; 

Schlie3en der einzigen Dateisystemtransaktion nach AbschluB einer letzten der momentanen Dateisystem- 
operationen; und 

Sichern der einzigen Dateisystemtransaktion, die mehrere momentane synchrone Dateisystemoperationen 
enthalt, in der Computermassenspeichen/orrichtung in einem einzigen Schreibvorgang. 

2. Verfahren nach Anspruch 1 , das vor dem Schritt des Offnens zusatzlich die folgenden Schritte enthalt: 

Eingeben einer ersten der synchronen Dateisystemoperationen; und 
Warten auf das Offnen der einzigen Dateisystemtransaktion. 

3. Verfahren nach Anspruch 1 , bei dem der Schritt des Sicherns ferner den folgenden Schritt enthalt: 

Schreiben der einzigen Dateisystemtransaktion, die die mehreren momentanen synchronen Dateisystem- 
operationen enthalt, in die Computermassenspeichervorrichtung. 

4. Verfahren nach Anspruch 1 , bei dem die Schritte des Offnens, SchlieBens und Sicherns mittels einer Meta-trans- 
Vorrichtung (32) ausgefuhrt werden, die eine Dateisystemschicht (30) des Betriebssystem mit einem Treiber (58) 
fur die Computermassenspeichervorrichtung koppelt. 

5. Computer (1), der ein Computerbetriebssystem, das in ihn geladen werden kann, um Anwendungsprogramme 
ablaufen zu lassen, sowie eine Computermassenspeichervorrichtung (9), die dem Computer zugeordnet ist und 
als Antwort auf ein Journaldateisystem des Betriebssystems Daten empfangt, enthalt, wobei das Betriebssystem 
enthalt: 

eine Transaktionsschicht (34), die auf das Journaldateisystem anspricht, um eine einzige Dateisystemtrans- 
aktion zum Akkumulieren mehrerer synchroner Dateisystemoperationen zu erzeugen; 
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eine Dateisystemschicht (30), die auf die Transaktionsvorrichtung (34) anspricht, um die meiireren momen- 
tanen synch ronen Dateisystemoperationen auszufuliren; und 

einen Metatrans-Treiber (32), der auf die Transaktionsschicht (34) und auf die Dateisystennschicht (30) an- 
spricht, um die einzelne Dateisystemtransaktion in der Computermassenspeichervorrichtung in einer einzigen 
Schreiboperation zu sichern. 

6. Vorrichtung nach Anspruch 5, wobei der Metatrans-Treiber (32) eine Protokollvorrichtung (46) enthalt, die auf die 
Transaktionsschicht anspricht, um Dateneintrage zu speichern, die die mehreren momentanen synchronen Da- 
teisystemoperationen darstellen. 

7. Computer nach Anspruch 6, wobei der IVletatrans-Treiber (32) ferner einen Schreibpuffer (50) enthalt, der mit der 
Protokollvorrichtung (46) gekoppelt ist, um die Dateneintrage zu akkumulieren. 

8. Computer nach Anspruch 7, wobei der Metatrans-Treiber (32) ferner ein Delta-Map (48) enthalt, das die Dateisy- 
stemschicht (30) mit dem Schreibpuffer (50) koppelt, um Information en, die Anderungen in jeder der mehreren 
synchronen Dateisystemoperationen entsprechen, aufzuzeichnen. 

9. Computer nach Anspruch 8, wobei der Metatrans-Treiber (32) ferner ein Protokoii-Map (54) enthalt, das mit der 
Protokollvorrichtung (46) und mit dem Schreibpuffer (50) gekoppelt ist, um Informationen, die den Eintragen im 
Schreibpuffer entsprechen, zu speichern. 

Revendications 

1. Procede d'ecriture de donnees sur un dispositif de memorisation de masse d'ordinateur en une seule operation 
d'ecriture, en relation avec un systeme d'exploitation d'ordinateur comprenant un systeme de fichiers d'historique, 
ledit procede comprenant les etapes consistant a: 

assurer I'ouverture d'une transaction unique de systeme de fichiers afin de regrouper une pluralite d'operations 
de systeme de fichiers synchrones courantes; 

assurer I'execution de ladite pluralite d'operations de systeme de fichiers synchrones courantes; 

assurer la fermeture de ladite transaction unique de systeme de fichiers a I'achevement d'un derniere desdites 

operations de systeme de fichiers courantes; et 

assurer I'enregistrement de ladite transaction unique de systeme de fichiers contenant ladite pluralite d'ope- 
rations de systeme de fichiers synchrones courantes sur ledit dispositif de memorisation de masse d'ordinateur 
en une seule operation d'ecriture. 

2. Procede selon la revendication 1 , dans lequel, avant ladite etape d'ouverture, le procede comprend, en outre, les 
etapes consistant a: 

assurer I'entree d'une premiere desdites operations de systeme de fichiers synchrones; et 
assurer I'attente de I'ouverture de ladite transaction unique de systeme de fichiers. 

3. Procede selon la revendication 1 , dans lequel ladite etape d'enregistrement comprend, en outre, I'etape consistant 

a: 

assurer I'ecriture de ladite transaction unique de systeme de fichiers contenant ladite pluralite d'operations 
de systeme de fichiers synchrones courantes sur ledit dispositif de memorisation de masse d'ordinateur 

4. Procede selon la revendication 1 , dans lequel les etapes d'ouverture, de fermeture et d'enregistrement sont mises 
en oeuvre au moyen d'un dispositif "metatrans" (metadispositif de transaction) (32) assurant le couplage d'une 
couche de systeme de fichiers (30) dudit systeme d'exploitation a un gestionnaire (58) dudit dispositif de memo- 
risation de masse d'ordinateur. 

5. Ordinateur (1) comprenant un systeme d'exploitation informatique pouvant etre charge afin d'executer des pro- 
grammes d'application, un dispositif de memorisation de masse d'ordinateur (9) associe audit ordinateur afin de 
recevoir des donnees en reponse a un systeme de fichiers d'historique dudit systeme d'exploitation, dans lequel 
ledit systeme d'exploitation comprend: 
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une couche de transaction (34) reagissant audit systeme de fichiers d'historique pour creer une transaction 
unique de systenne de ficliiers afin de regrouper une pluralite d'operations de systeme de ficlniers synchrones; 
une couche de systeme de fichiers (30) reagissant audit dispositif de transaction (34) pour executer ladite 
pluralite d'operations de systeme de fichiers synchrones courantes; et 

un gestionnaire "metatrans" (32) reagissant a ladite couche de transaction (34) et a ladite couche de systeme 
de fichiers (30) pour enregistrer ladite transaction unique de systeme de fichiers sur ledit dispositif de memo- 
risation de masse d'ordinateur en une seule operation d'ecriture. 

Ordinateur selon la revendication 5, dans lequel ledit gestionnaire "metatrans" (32) comprend un dispositif d'en- 
registrement en liste (46) reagissant a ladite couche de transaction pour memoriser des entrees de donnees re- 
presentatives de ladite pluralite d'operations de systeme de fichiers synchrones courantes. 

Ordinateur selon la revendication 6, dans lequel ledit gestionnaire "metatrans" (32) comprend, en outre, un tampon 
d'ecriture (50) couple audit dispositif d'enregistrement en liste (46) afin de regrouper lesdites entrees de donnees. 

Ordinateur selon la revendication 7, dans lequel ledit gestionnaire "metatrans" (32) comprend, en outre, une carte 
d'ecart (48) assurant le couplage de ladite couche de systeme de fichiers (30) audit tampon d'ecriture (50) afin 
d'enregistrer des informations correspondant a des modifications dans chacune de ladite pluralite d'operations de 
systeme de fichiers synchrones. 

Ordinateur selon la revendication 8, dans lequel ledit gestionnaire "metatrans" (32) comprend, en outre, une carte 
de liste d'enregistrement (54) couplee audit dispositif d'enregistrement en liste (46) et audit tampon d'ecriture (50) 
afin de memoriser des informations correspondant a des entrees dans ledit tampon d'ecriture. 
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